[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Mikejmnez · 2025-08-12T18:46:58Z

Closes make pydap backend more opendap-like by downloading multiple variables in same http request #10628
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

With this PR, the following is true:

import xarray as xr
from requests_cache import CachedSession
session=CachedSession(cache_name='debug')
session.cache.clear()

dap4urls = ["dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc", 
            "dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc"]

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=session, concat_dim='TIME', parallel=True, combine='nested', decode_times=False)

session.cache.urls()
>>>['http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dmr',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dmr']

And so the dimensions are batched (downloaded) together in same always in DAP4.

In addition to this, and to preserve backwards functionality before, I added an backend argument batch=True | False. When batch=True, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).
When batch=False, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.

cache_session=CachedSession(cache_name='debug')

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=cache_session, parallel=True, combine='nested', concat_dim="TIME", decode_times=False, batch=True)

len(cache_session.cache.urls())
>>> 4 # 1dmr and 1 dap per file (2 files)

# triggers all non-dimension data to be downloaded in a single http request
ds.load()

len(cache_session.cache.urls())
>>> 6 # the previous 4, plus an extra request extra per file

When batch=False (False is the default) , the last step (ds.load()) triggers individual downloads.

These changes allow a more performant download experience with xarray+pydap. However ,must of these changes depend on a yet-to-release version of pydap (3.5.6). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code. pydap 3.5.6 has been released!

Mikejmnez · 2025-08-13T16:49:53Z

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

shoyer

Thanks @Mikejmnez !

xarray/backends/pydap_.py

shoyer · 2025-08-18T17:27:38Z

xarray/backends/pydap_.py

        timeout=None,
        verify=None,
        user_charset=None,
+        batch=False,


Would it make sense to have the default be batch=None, which means "use batching if possible"? This would expose these benefits to more users.

I am not sure I fully understand what you mean. Do you mean? batch = None|dict, where in the dict a user specifies which variables to download together? Or do you mean batch if dap4?

batch= True|False is intended to be used at the moment, as a way to download (stream) data faster, and make scalable workflows (when having to aggregate 100s of urls on the client side) by downloading multiple variables at once (single url).

ds = xr.open_mfdataset(urls, engine='pydap', ...., batch=True) ds.(<define_slice_here>).to_zarr # .to_netcdf or whatever...

and so per dataset, you get roughly a single dap url with all variables.

NOTE: I did make the change to batch = None as default, and I am up for setting batch = None | dict to enable broader usage in the future. pydap could easily support the dict aspect. For now is *all* available or None.

batch = None|dict

I see the benefit to setting batch = None|dict to specify which variables to download together. But with opendap urls, you can already specify a filter to reduce, from the original source file, which variables to access to. For example:

new_url = base_url + "?dap4.ce=/var1;/var2;/....;/VarN"

where N<=M amount of variables in the original remote file.

(note this is very different from xarray.Dataset.drop_variables, since xarray first parses all M variables and then it discard the M-N variables --> not very useful when M~O(1000) and N~O(1)).

batch if dap4 (if possible)

This is a bit tricky. Some servers are configured to provide a single opendap url for an aggregated view of the entire dataset (an .ncml). This is for both dap2 and dap4 protocol. For opendap servers in the cloud, this is not used (not sure if it is possible). And so this batch=True makes most sense for the non-aggregated views of the dataset.

I think the danger would be when using batch=True on an aggregated view of the dataset, as it would attempt to download all of it on a single request.

shoyer · 2025-08-18T17:35:40Z

hmm - the test I see that fails (sporadically) concerns the following assertion:
Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...
where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

shoyer · 2025-08-18T21:44:15Z

hmm - the test I see that fails (sporadically) concerns the following assertion:
Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...
where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...
This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

I'm seeing the same error over here:
#10649

Not quite sure what to make of this, but seems to be a separate bug.

Mikejmnez · 2025-08-18T22:14:07Z

Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :)

Mikejmnez · 2025-08-19T17:03:40Z

xarray/backends/pydap_.py


    def get_dimensions(self):
-        return Frozen(self.ds.dimensions)
+        return Frozen(sorted(self.ds.dimensions))


To potentially address the issues with dimensions in Datatree, and the lat/lon dimensions being inconsistently ordered, I added this sorted to the dimensions list that the backend gets from the Pydap dataset directly. Hopefully this little fix will make it go away, but I will continue checking this issue locally and after merging main into this PR (it has not failed once yet! knocks on wood)

This is only dataset level dimensions, not variable level dimensions.

At the dataset level, dimension order doesn't really matter, so I doubt this is going to fix the issue, unfortunately.

Mikejmnez · 2025-09-19T15:46:31Z

@shoyer I had a second go at this finally. Moved much of the logic to the backend.

Here is the current state of things:

This PR installs pydap from source. Why? I want to leave the door open for changes on the pydap backend, that may arise from this PR, and include them in the new pydap release. Only when there is a general feeling that this PR is ready to be merged will I then make a pydap release and revert to installing pydap from conda. More comments/request for changes about this PR are welcome!
~~Failing test is unrelated to this PR. But I think I found the potential culprit in the dap4 metadata parser in pydap. Will spend today working on that. This needs to be fixed asap.~~

Mikejmnez · 2025-09-26T17:23:26Z

@shoyer This is ready for further reviewing.

Pydap has a new release that fixes some issues on the backend xml parser (there was a bug that got fixed). I think there may be some additional work to be needed in the next couple of weeks, but these are unrelated to this PR anyways...

~~I did not know what to make of Mypy fails, but these also fail on the main branch too~~. Fixed in #10792

… all together in single dap url

…ed at once (per group)

…lable

…stall after new release if no further change to backend

Mikejmnez · 2025-09-30T20:48:14Z

@shoyer Let me know if there is any feedback, concerns, further reviewing, etc.

This PR enables a new (non-default) feature that was added to the pydap backend over the span of several months, namely the ability to download multiple variables within single request, according to the opendap spec. Without this feature, each variable is downloaded separately, which does not take advantage of the opendap protocol, and can make pydap unusable when each remote file has ~>2-3 variables, and there are at least >10 urls to consolidate (for example via mds = xr.open_mfdataset and then mdf.to_zarr or something).

This PR also makes it so that when accessing via dap4 protocol, all dimensions are downloaded within single request by default, always. This is the most performant approach compared to downloading each dimension using a separate request. This again improves performance when "only opening" multiple remote files.

github-actions bot added topic-backends CI Continuous Integration tools dependencies Pull requests that update a dependency file io labels Aug 12, 2025

Mikejmnez changed the title ~~Pydap4 scale~~ [pydap backend] enables downloading/processing multiple arrays within single http request Aug 12, 2025

Mikejmnez marked this pull request as ready for review August 13, 2025 07:11

Mikejmnez mentioned this pull request Aug 18, 2025

make pydap backend more opendap-like by downloading multiple variables in same http request #10628

Open

shoyer reviewed Aug 18, 2025

View reviewed changes

Mikejmnez commented Aug 19, 2025

View reviewed changes

Mikejmnez force-pushed the pydap4_scale branch from b3c77a0 to aaa07c4 Compare September 18, 2025 20:17

Mikejmnez force-pushed the pydap4_scale branch from 1687221 to 20fb5cd Compare September 26, 2025 15:35

Mikejmnez force-pushed the pydap4_scale branch from 9c15100 to 6c45f50 Compare September 26, 2025 22:21

Mikejmnez added 12 commits September 30, 2025 13:34

update PydapArrayWrapper to support backend batching

0e345a6

update PydapDataStore to use backend logic in dap4 to batch variables…

4dbcd62

… all together in single dap url

pydap-server it not necessary

007dac2

set batch=False as default

76faff6

set batch=False as default in datatree

16a9341

set batch=False as default in open groups as dict

6f8afb0

for flaky, install pydap from repo for now

70f500f

initial tests - quantify cached url

1ac0ab4

adds tests to datatree backend to assert multiple dimensions download…

3a79592

…ed at once (per group)

update testing to show number of download urls

a8fe8fe

simplified logic

1f65ef6

specify cached session debug name to actually cache urls

0d22358

Mikejmnez added 26 commits September 30, 2025 13:34

fix for mypy

3205515

user visible changes on whats-new.rst

cb33c28

impose sorted to get_dimensions method

f85a0b9

reformat whats-new.rst

263592d

revert to install pydap from conda and not from repo

9e5c785

expose checksum as user kwarg

73aa5a1

include checksums optional argument in whats-new

f59b57d

update to newest release of pydap via pip until conda install is avai…

6c354ca

…lable

use requests_cache session with retry-params when 500 errors occur

eb3fca5

update env yml file to use new pydap release via conda

16394aa

let pydap handle exceptions/warning

1b76e98

process dims at once, one per group

2ca8a4d

debug

652e5d6

revert what`s new from previous commit

8406494

enable data checker for batched deserialized data

a5c6ba2

temporarily install from source for testing - will revert to conda in…

a111b0c

…stall after new release if no further change to backend

update whats new

fd84f63

update tests

5da338c

set batch=None as default

0c55e52

improve handling of dims vs dimensions deprecation warning

36ea456

update to use latest version of pydap

863ea6d

update import

fc72d32

update `whats new docs

049ff2e

move cache session to tmpdir

4f5a715

remove added functionality from whats new from newly released version

47b7e73

add to whats-new for next release

4b516b4

Mikejmnez force-pushed the pydap4_scale branch from aac3163 to 4b516b4 Compare September 30, 2025 20:38

Mikejmnez requested a review from shoyer September 30, 2025 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Uh oh!

Mikejmnez commented Aug 12, 2025 •

edited

Loading

Uh oh!

Mikejmnez commented Aug 13, 2025 •

edited

Loading

Uh oh!

shoyer left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer Aug 18, 2025

Uh oh!

Mikejmnez Sep 19, 2025 •

edited

Loading

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

Mikejmnez commented Aug 18, 2025 •

edited

Loading

Uh oh!

Mikejmnez Aug 19, 2025

Uh oh!

shoyer Aug 19, 2025

Uh oh!

Mikejmnez commented Sep 19, 2025 •

edited

Loading

Uh oh!

Mikejmnez commented Sep 26, 2025 •

edited

Loading

Uh oh!

Mikejmnez commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Are you sure you want to change the base?

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Uh oh!

Conversation

Mikejmnez commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Mikejmnez Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

batch = None|dict

batch if dap4 (if possible)

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

Mikejmnez commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

shoyer Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Mikejmnez commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Mikejmnez commented Aug 12, 2025 •

edited

Loading

Mikejmnez commented Aug 13, 2025 •

edited

Loading

Mikejmnez Sep 19, 2025 •

edited

Loading

`batch = None|dict`

Mikejmnez commented Aug 18, 2025 •

edited

Loading

Mikejmnez commented Sep 19, 2025 •

edited

Loading

Mikejmnez commented Sep 26, 2025 •

edited

Loading

Mikejmnez commented Sep 30, 2025 •

edited

Loading